range-diff: add configurable memory limit for cost matrix #1958

pcasaretto · 2025-08-22T08:08:45Z

Problem Description

When git range-diff is given extremely large ranges, it can result in either:

Segmentation fault due to integer overflow in array index calculations
Excessive memory consumption leading to system hangs or OOM kills
Poor user experience with the command appearing to hang for minutes

Reproduction Case

In a Shopify's large monorepo a range-diff command like this crashes after several minutes with a SIGBUS error

$ git range-diff 4430f36511..316c1276c6 cb5240b6a8..2bbd292091

Range statistics:

First range: 256,783 commits
Second range: 1 commit
Total: 256,784 commits
Memory required for cost matrix: n² × 4 bytes = ~260GB

Stack Trace (Segmentation Fault)

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x6e000ae3b8)
  * frame #0: 0x000000010029a284 git`get_correspondences(a=0x000000016fde6188, b=0x000000016fde6160, creation_factor=60) at range-diff.c:356:20
    frame #1: 0x0000000100299310 git`show_range_diff(range1="4430f36511cbacf5c517c6851e2b8508a72dfd30..316c1276c63f55ad9413fa18bf3b6483564a9cf4", range2="cb5240b6a8ba59b4a1f282559ee0742721b0cafc..2bbd292091e376d177ce62e264dae8872ca6be5a", range_diff_opts=0x000000016fde6308) at range-diff.c:593:3
    frame #2: 0x00000001000c719c git`cmd_range_diff(argc=2, argv=0x0000600000bcd8c0, prefix="areas/core/shopify/", repo=0x0000000100468b20) at range-diff.c:167:8
    frame #3: 0x000000010000277c git`run_builtin(p=0x00000001004408d8, argc=3, argv=0x0000600000bcd8c0, repo=0x0000000100468b20) at git.c:480:11
    frame #4: 0x0000000100001020 git`handle_builtin(args=0x000000016fde6c70) at git.c:746:9
    frame #5: 0x0000000100002074 git`run_argv(args=0x000000016fde6c70) at git.c:813:4
    frame #6: 0x0000000100000d3c git`cmd_main(argc=3, argv=0x000000016fde7350) at git.c:953:19
    frame #7: 0x000000010012750c git`main(argc=4, argv=0x000000016fde7348) at common-main.c:9:11
    frame #8: 0x000000018573ab98 dyld`start + 6076

Root Cause Analysis

The crash occurs in get_correspondences() at line 356:

static void get_correspondences(struct string_list *a, struct string_list *b, ...)
{
    int n = a->nr + b->nr;  // Integer overflow: 256,784 fits in int
    ...
    ALLOC_ARRAY(cost, st_mult(n, n));  // Would allocate ~260GB
    ...
    cost[i + n * j] = c;  // Line 356: Invalid memory access
}

Problems:

Integer overflow: While n=256,784 fits in an int, n*n overflows
Memory allocation: Even with proper types, allocating 260GB is impractical

Solution

Add a memory limit check in get_correspondences() before allocating the
cost matrix. This check uses the total size in bytes (n² × sizeof(int))
and compares it against a configurable maximum, preventing both
excessive memory usage and integer overflow issues.

The limit is configurable via a new --max-memory option that accepts
human-readable sizes (e.g., "1G", "500M"). The default is 4GB for 64 bit
systems and 2GB for 32 bit systems. This allows comparing ranges of
approximately 32,000 (16,000) commits - generous for real-world use cases
while preventing impractical operations.

When the limit is exceeded, range-diff now displays a clear error
message showing both the requested memory size and the maximum allowed,
formatted in human-readable units for better user experience.

Example usage:
git range-diff --max-memory=1G branch1...branch2
git range-diff --max-memory=500M base..topic1 base..topic2

This approach was chosen over alternatives:

Pre-counting commits: Would require spawning additional git processes
and reading all commits twice
Limiting by commit count: Less precise than actual memory usage
Streaming approach: Would require significant refactoring of the
current algorithm

This issue was previously discussed in:
https://lore.kernel.org/git/[email protected]/

[Acked-by: Johannes Schindelin [email protected]](#1958 (comment))

dscho

Thank you for this patch! It is reasonable, I just have one suggestion how to improve it.

Please note that there has been a highly over-engineered attempt at addressing this problem before: https://lore.kernel.org/git/[email protected]/t/#me423268c4f14a0d37c0ac3e83dc7d5e9cea3661a. You probably want to mention this in the "cover letter" (i.e. in the initial PR comment that will be sent), even though that patch series' contributor seems to be AWOL for years already.

range-diff.c

pcasaretto · 2025-08-22T12:48:23Z

Thank you @dscho for the thoughtful review!

I attempted to implement your suggestion of checking content size within read_patches(), but discovered an issue:

read_patches() currently buffers the entire output of git log -p into memory before processing:

  if (strbuf_read(&contents, cp.out, 0) < 0) {  // Line 87 - reads ALL output
      error_errno(_("could not read `log` output"));
      ...
  }
  // Only AFTER reading everything do we process line by line
  for (; size > 0; size -= len, line += len) {
      // Check limits here is too late - memory already consumed
  }

For the test case with 256k commits, this means ~6GB is read into the contents strbuf before any limits can be checked. By the time we could check content size or commit count in the loop, the memory is already exhausted.

To properly implement early exit as you suggested, we would need to:

Refactor read_patches() to process the git log output in a streaming fashion
Read and process line-by-line from the pipe instead of buffering everything
Check limits during streaming

Would you prefer:

Keep this simpler fix that at least prevents the crash (two passes but prevents the memory issue)
Attempt the more complex streaming refactor

I'll also reference the previous RFC attempt as you suggested.

dscho · 2025-08-22T13:00:49Z

@pcasaretto wow, thorough work! Personally, I would prefer the streaming approach, but I could understand if it is unreasonable to ask for such a huge refactor just to get the bug fix in. Your choice!

range-diff.c

pcasaretto · 2025-08-22T17:35:32Z

After pairing with @thehcma, we've updated the approach to address the memory exhaustion issue more directly.

Instead of pre-counting commits, we now check the actual memory requirements of the cost matrix just before allocation in get_correspondences(). This approach:

Checks actual memory usage: We calculate n² × sizeof(int) and compare it against a configurable limit, which is more precise than just counting commits.
Adds a --max-memory option: Users can specify memory limits using human-readable sizes (e.g., --max-memory=1G, --max-memory=500M). The option accepts standard suffixes (k/m/g) that Git users are familiar with from other commands.
Defaults to 4GB: This allows comparing ranges of approximately 32,000 commits, which should be generous for real-world use cases while preventing impractical operations that would exhaust memory.

Provides clear error messages: When the limit is exceeded, users see both the required and available memory in human-readable format, for example:

fatal: range-diff: unable to compute the range-diff, since it exceeds
the maximum memory for the cost matrix: 256 GiB (274877906944 bytes)
needed, 4.00 GiB (4294967296 bytes) available

This solution avoids the performance overhead of spawning additional processes while still preventing the crashes. Worth noting, that the process still takes a while to process and takes up around 10GB for the particular command that triggered the crash. As you noted, integrating this into read_patches() would be ideal, but that would require significant refactoring since it currently buffers all output before processing. Although I'm interested in the attempt, I think this is a good starting point.

What do you think about this approach, particularly:

The choice of 4GB as the default limit?
The --max-memory option name and syntax?

gitgitgadget · 2025-08-24T12:09:44Z

There are issues in commit dc9c6a6:
range-diff: add configurable memory limit for cost matrix
Lines in the body of the commit messages should be wrapped between 60 and 76 characters.
Indented lines, and lines without whitespace, are exempt

pcasaretto · 2025-08-24T14:12:04Z

Update: 4GB was too much for 32bit systems. Made the limit 2GB in those cases.

range-diff.h

range-diff.c

builtin/range-diff.c

dscho · 2025-08-26T13:13:55Z

pcasaretto requested a review from dscho yesterday

I like your approach!

@pcasaretto please note that I am not a gate keeper here. The Git project does not accept code reviews in PRs, it requires the code review to happen on the list. In other words: Please /submit.

If you'd like, I invite you to add an "Acked-by: Johannes Schindelin [email protected]" to the commit message footer (right before your "Signed-off-by:" line) and refer to this here comment in the "cover letter", i.e. in the PR description which will be sent as part of the email to the Git mailing list.

When comparing large commit ranges (e.g., 250,000+ commits), range-diff attempts to allocate an n×n cost matrix that can exhaust available memory. For example, with 256,784 commits (n = 513,568), the matrix would require approximately 256GB of memory (513,568² × 4 bytes), causing either immediate segmentation faults due to integer overflow or system hangs. Add a memory limit check in get_correspondences() before allocating the cost matrix. This check uses the total size in bytes (n² × sizeof(int)) and compares it against a configurable maximum, preventing both excessive memory usage and integer overflow issues. The limit is configurable via a new --max-memory option that accepts human-readable sizes (e.g., "1G", "500M"). The default is 4GB for 64 bit systems and 2GB for 32 bit systems. This allows comparing ranges of approximately 32,000 (16,000) commits - generous for real-world use cases while preventing impractical operations. When the limit is exceeded, range-diff now displays a clear error message showing both the requested memory size and the maximum allowed, formatted in human-readable units for better user experience. Example usage: git range-diff --max-memory=1G branch1...branch2 git range-diff --max-memory=500M base..topic1 base..topic2 This approach was chosen over alternatives: - Pre-counting commits: Would require spawning additional git processes and reading all commits twice - Limiting by commit count: Less precise than actual memory usage - Streaming approach: Would require significant refactoring of the current algorithm This issue was previously discussed in: https://lore.kernel.org/git/[email protected]/ Acked-by: Johannes Schindelin [email protected] Signed-off-by: pcasaretto <[email protected]>

pcasaretto · 2025-08-26T17:17:30Z

/submit

gitgitgadget · 2025-08-26T17:18:27Z

Submitted as [email protected]

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-1958/pcasaretto/range-diff-size-limit-v1

To fetch this version to local tag pr-1958/pcasaretto/range-diff-size-limit-v1:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-1958/pcasaretto/range-diff-size-limit-v1

gitgitgadget · 2025-08-26T17:19:11Z

Error: 5cf3e89 was already submitted

gitgitgadget · 2025-08-26T19:22:18Z

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Paulo Casaretto via GitGitGadget" <[email protected]> writes:

> From: pcasaretto <[email protected]>

<administrivia>

It is usual to see a less human readable name embedded in the commit
object than the mail header when a mail comes from GGG.  

Just in case you want to be known to this community as "Paulo
Casaretto", not "pcasaretto", I thought I'd point it out that you
may want to redo the commit.  I do not mind what name you like to
use, as long as it is identifiable, and From: identity matches the
identity you add your Signed-off-by: with.

</administrivia>

> Acked-by: Johannes Schindelin [email protected]

It is unusual to lack <> around e-mail address here.

> Signed-off-by: pcasaretto <[email protected]>
> ---
>     range-diff: add configurable memory limit for cost matrix

> +static int parse_max_memory(const struct option *opt, const char *arg, int unset)
> +{
> +	size_t *max_memory = opt->value;
> +	uintmax_t val;
> +
> +	if (unset) {
> +		return 0;
> +	}

No unnecessary {braces} around a single statement, please.

> +	if (!git_parse_unsigned(arg, &val, SIZE_MAX))
> +		return error(_("invalid max-memory value: %s"), arg);
> +
> +	*max_memory = (size_t)val;
> +	return 0;
> +}

> @@ -33,17 +51,21 @@ int cmd_range_diff(int argc,
>  		OPT_INTEGER(0, "creation-factor",
>  			    &range_diff_opts.creation_factor,
>  			    N_("percentage by which creation is weighted")),
> +		OPT_PASSTHRU_ARGV(0, "diff-merges", &diff_merges_arg,
> +				  N_("style"), N_("passed to 'git log'"), 0),
> +		OPT_BOOL(0, "left-only", &left_only,
> +			 N_("only emit output related to the first range")),
> +		OPT_CALLBACK(0, "max-memory", &range_diff_opts.max_memory,
> +			     N_("size"),
> +			     N_("maximum memory for cost matrix (default 4G)"),
> +			     parse_max_memory),
>  		OPT_BOOL(0, "no-dual-color", &simple_color,
>  			    N_("use simple diff colors")),
>  		OPT_PASSTHRU_ARGV(0, "notes", &other_arg,
>  				  N_("notes"), N_("passed to 'git log'"),
>  				  PARSE_OPT_OPTARG),
> -		OPT_PASSTHRU_ARGV(0, "diff-merges", &diff_merges_arg,
> -				  N_("style"), N_("passed to 'git log'"), 0),
>  		OPT_PASSTHRU_ARGV(0, "remerge-diff", &diff_merges_arg, NULL,
>  				  N_("passed to 'git log'"), PARSE_OPT_NOARG),
> -		OPT_BOOL(0, "left-only", &left_only,
> -			 N_("only emit output related to the first range")),
>  		OPT_BOOL(0, "right-only", &right_only,
>  			 N_("only emit output related to the second range")),
>  		OPT_END()

This seems to mix unrelated changes.  Please don't.

Or if the reordering of options do have a reason to exist in _this_
commit, please justify it in your proposed log message.  Even if
there were a good reason for reordering existing options, I strongly
suspect that the change would want to be done in a separate,
preparatory-clean-up commit (i.e., making this topic a two-patch
series), because it has nothing to do with preventing inefficient
cost matrix computation from consuming too much memory, which _is_
the theme of this commit.

> diff --git a/range-diff.c b/range-diff.c
> index 8a2dcbee322..6e9b6b115e5 100644
> --- a/range-diff.c
> +++ b/range-diff.c
> @@ -21,6 +21,7 @@
>  #include "apply.h"
>  #include "revision.h"
>  
> +

Unrelated, unexplained, and unnecessary change snuck in?  Please
proof-read the patch yourself before sending.

> @@ -287,8 +288,8 @@ static void find_exact_matches(struct string_list *a, struct string_list *b)
>  }
>  
>  static int diffsize_consume(void *data,
> -			     char *line UNUSED,
> -			     unsigned long len UNUSED)
> +			    char *line UNUSED,
> +			    unsigned long len UNUSED)

What is this change about???

>  static void get_correspondences(struct string_list *a, struct string_list *b,
> -				int creation_factor)
> +				int creation_factor, size_t max_memory)
>  {
>  	int n = a->nr + b->nr;
>  	int *cost, c, *a2b, *b2a;
>  	int i, j;
> -
> -	ALLOC_ARRAY(cost, st_mult(n, n));
> +	size_t cost_size = st_mult(n, n);
> +	size_t cost_bytes = st_mult(sizeof(int), cost_size);
> +	if (cost_bytes >= max_memory) {
> +		struct strbuf cost_str = STRBUF_INIT;
> +		struct strbuf max_str = STRBUF_INIT;
> +		strbuf_humanise_bytes(&cost_str, cost_bytes);
> +		strbuf_humanise_bytes(&max_str, max_memory);
> +		die(_("range-diff: unable to compute the range-diff, since it "
> +		      "exceeds the maximum memory for the cost matrix: %s "
> +		      "(%"PRIuMAX" bytes) needed, %s (%"PRIuMAX" bytes) available"),
> +		    cost_str.buf, (uintmax_t)cost_bytes, max_str.buf, (uintmax_t)max_memory);
> +	}
> +	ALLOC_ARRAY(cost, cost_size);

Nicely done.

> @@ -351,7 +363,8 @@ static void get_correspondences(struct string_list *a, struct string_list *b,
>  		}
>  
>  		c = a_util->matching < 0 ?
> -			a_util->diffsize * creation_factor / 100 : COST_MAX;
> +			    a_util->diffsize * creation_factor / 100 :
> +			    COST_MAX;
>  		for (j = b->nr; j < n; j++)
>  			cost[i + n * j] = c;
>  	}

There seem to be other unrelated changes indentation-only changes
mixed in to the changes to this file, not just this one.

As a style fix, 

		c = a_util->matching < 0
		  ? a_util->diffsize * creation_factor / 100
		  : COST_MAX;

would be easier to follow and read, but please do not do such a
cosmetic clean-up in the same patch.  Do them in a separate
preliminary clean-up patch before the "real work".

> @@ -591,7 +605,8 @@ int show_range_diff(const char *range1, const char *range2,
>  	if (!res) {
>  		find_exact_matches(&branch1, &branch2);
>  		get_correspondences(&branch1, &branch2,
> -				    range_diff_opts->creation_factor);
> +				    range_diff_opts->creation_factor,
> +				    range_diff_opts->max_memory);
>  		output(&branch1, &branch2, range_diff_opts);
>  	}

OK.

dscho reviewed Aug 22, 2025

View reviewed changes

range-diff.c Outdated Show resolved Hide resolved

thehcma reviewed Aug 22, 2025

View reviewed changes

range-diff.c Outdated Show resolved Hide resolved

pcasaretto force-pushed the range-diff-size-limit branch from 1a92256 to daea1fe Compare August 22, 2025 17:29

pcasaretto force-pushed the range-diff-size-limit branch 7 times, most recently from cd92fde to e308b55 Compare August 24, 2025 11:14

pcasaretto changed the title ~~range-diff: add early size check to prevent long delays and crashes~~ range-diff: add configurable memory limit for cost matrix Aug 24, 2025

pcasaretto force-pushed the range-diff-size-limit branch from e308b55 to dc9c6a6 Compare August 24, 2025 12:08

pcasaretto force-pushed the range-diff-size-limit branch from dc9c6a6 to f6a1c6d Compare August 24, 2025 14:12

pcasaretto requested a review from dscho August 25, 2025 12:27

hcmaATshopify reviewed Aug 25, 2025

View reviewed changes

range-diff.h Outdated Show resolved Hide resolved

range-diff.c Show resolved Hide resolved

builtin/range-diff.c Show resolved Hide resolved

pcasaretto force-pushed the range-diff-size-limit branch from f6a1c6d to 90d0059 Compare August 25, 2025 18:41

pcasaretto force-pushed the range-diff-size-limit branch from 90d0059 to 5cf3e89 Compare August 26, 2025 17:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

range-diff: add configurable memory limit for cost matrix #1958

range-diff: add configurable memory limit for cost matrix #1958

pcasaretto commented Aug 22, 2025 •

edited

Loading

Uh oh!

dscho left a comment

Uh oh!

Uh oh!

pcasaretto commented Aug 22, 2025 •

edited

Loading

Uh oh!

dscho commented Aug 22, 2025

Uh oh!

Uh oh!

pcasaretto commented Aug 22, 2025

Uh oh!

gitgitgadget bot commented Aug 24, 2025

Uh oh!

pcasaretto commented Aug 24, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dscho commented Aug 26, 2025

Uh oh!

pcasaretto commented Aug 26, 2025

Uh oh!

gitgitgadget bot commented Aug 26, 2025

Uh oh!

gitgitgadget bot commented Aug 26, 2025

Uh oh!

gitgitgadget bot commented Aug 26, 2025

Uh oh!

Uh oh!

range-diff: add configurable memory limit for cost matrix #1958

Are you sure you want to change the base?

range-diff: add configurable memory limit for cost matrix #1958

Conversation

pcasaretto commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem Description

Reproduction Case

Stack Trace (Segmentation Fault)

Root Cause Analysis

Solution

Uh oh!

dscho left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pcasaretto commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dscho commented Aug 22, 2025

Uh oh!

Uh oh!

pcasaretto commented Aug 22, 2025

Uh oh!

gitgitgadget bot commented Aug 24, 2025

Uh oh!

pcasaretto commented Aug 24, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dscho commented Aug 26, 2025

Uh oh!

pcasaretto commented Aug 26, 2025

Uh oh!

gitgitgadget bot commented Aug 26, 2025

Uh oh!

gitgitgadget bot commented Aug 26, 2025

Uh oh!

gitgitgadget bot commented Aug 26, 2025

Uh oh!

Uh oh!

pcasaretto commented Aug 22, 2025 •

edited

Loading

pcasaretto commented Aug 22, 2025 •

edited

Loading